{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas\n",
    "\n",
    "<div class=\"subtitle\">Process LMQL results as Pandas DataFrames.</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "#notebooks.js:hidden\n",
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "import sys \n",
    "sys.path.append(\"../../../../src/\")\n",
    "# load and set OPENAI_API_KEY\n",
    "import os \n",
    "from lmql.runtime.openai_secret import openai_secret\n",
    "os.environ[\"OPENAI_API_KEY\"] = openai_secret"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When used from within Python, LMQL queries can be treated as simple python functions. This means building pipelines with LMQL is as easy as chaining together functions. However, next to easy integration, LMQL queries also offer a guaranteed output format when it comes to the data types and structure of the returned values. This makes it easy to process the output of LMQL queries in a structured way, e.g. with tabular/array-based data processing libraries such as Pandas.\n",
    "\n",
    "## Tabular Data Generation\n",
    "\n",
    "For example, to produce tabular data with LMQL, the following code snippet processes the output of an LMQL query with [pandas](https://pandas.pydata.org):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "     NAME  AGE                BREED  \\\n",
       "0   Storm    2     Golden Retriever   \n",
       "1            2     Golden Retriever   \n",
       "2   Lucky    3     Golden Retriever   \n",
       "3   Rocky    4   Labrador Retriever   \n",
       "4     Rex   11     Golden Retriever   \n",
       "5   Murky    5       Cocker Spaniel   \n",
       "6   Gizmo    5               Poodle   \n",
       "7   Bubba    3              Bulldog   \n",
       "\n",
       "                                              MOVE  \n",
       "0   Spinning in circles while chasing its own tail  \n",
       "1             Wiggles its entire body when excited  \n",
       "2                       Wiggles butt while walking  \n",
       "3                         Loves to chase squirrels  \n",
       "4                            Barks at anything red  \n",
       "5                                      Wiggle butt  \n",
       "6                              Spinning in circles  \n",
       "7                                                   "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#notebooks.js:show_stdout=false\n",
    "import lmql\n",
    "import pandas as pd\n",
    "\n",
    "@lmql.query\n",
    "async def generate_dogs(n):\n",
    "    '''lmql\n",
    "    sample(temperature=1.0, n=n)\n",
    "    \n",
    "    \"\"\"Generate a dog with the following characteristics:\n",
    "    Name:[NAME]\n",
    "    Age: [AGE]\n",
    "    Breed:[BREED]\n",
    "    Quirky Move:[MOVE]\n",
    "    \"\"\" where STOPS_BEFORE(NAME, \"\\n\") and STOPS_BEFORE(BREED, \"\\n\") and \\\n",
    "              STOPS_BEFORE(MOVE, \"\\n\") and INT(AGE) and len(TOKENS(AGE)) < 3\n",
    "    '''\n",
    "\n",
    "result = await generate_dogs(8)\n",
    "df = pd.DataFrame([r.variables for r in result])\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how we sample multiple sequences (i.e. dog instances) using the `sample(temperature=1.0, n=n)` decoder statement. \n",
    "\n",
    "The returned `result` is a list of [`lmql.LMQLResult`](../python.ipynb)), which we can easily convert to a `pandas.DataFrame` by accessing `r.variables` on each item.\n",
    "\n",
    "In the query, we use [scripted prompting](../../language/scripted-prompting.md) and [constraints](../../language/constraints.md) to make sure the generated dog samples are valid and each provides the attributes name, age, breed, and move.\n",
    "\n",
    "## Data Processing\n",
    "\n",
    "Converting the resulting values for LMQL template variables `NAME`, `AGE`, `BREED`, and `MOVE` to a `pandas.DataFrame`, makes it easy to apply\n",
    "further processing and work with the generated data. \n",
    "\n",
    "For instance, we can easily determine the average age of the generated dogs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6.0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# determine average age\n",
    "df[\"AGE\"].mean()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how the `INT(AGE)` constraints automatically converted the `AGE` values to integers, which makes the generated `AGE` values automatically amendable to arithmetic operations such as `mean()`."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on this tabular representation, it is now also easy to filter and aggregate the data. For instance, we can easily determine the average age of the generated dogs per breed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BREED\n",
       " Corgi                  5.00\n",
       " Golden Retriever      14.75\n",
       " Labrador Retriever     5.00\n",
       " Poodle                 2.00\n",
       "Name: AGE, dtype: float64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# group df by BREED and compute the average age for each breed\n",
    "df.groupby(\"BREED\")[\"AGE\"].mean()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "lmql",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
